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ABSTRACT 

One important and promising application of it em 
response theory (IRT) is computerized adaptive testing (CAT). The 
implementation of a nominal response model-based CAT (NRCAT) was 
studied. Item pool characteristics for the NRCAT as well as the 
comparative performance of the NRCAT and a CAT based on the 
three-parameter logistic (3PL) model were examined. Ability estimates 
were generated at test lengths of 10, 15, 20, 25, and 30 items from 
item pools of 90 items. Abilities were generated for 1,300 examinees 
in 1 study and for 900 examinees in the other study. Results show 
that for 2-, 3-, and 4-category items, items with maximum information 
of at least 0.16 produced reasonably accurate ability estimation for 
tests with a minimum test length of about 15 to 20 items. Moreover, 
the NRCAT was able to produce ability estimates comparable to those 
of the 3PL Cat. Implications of these results were discussed. Eight 
tables and six graphs illustrate the discussion. (SLD) 
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ABSTRACT 

One important and very promising application of item response theory (IRT) is 
computerized adaptive testing (CAT). Although most CATs use dichotomous IRT models, 
research on the use of polytomous IRT models in CAT has shown promising results. This 
study concerned the implementation of a nominal response rnodcl-based CAT (NR CAT). 
Item pool characteristics for the NR CAT as well as the comparative performance of the NR 
CAT and a CAT based on the three-parameter logistic (3PL) model were examined. Results 
showed that for two-, three-, and four-category items, items with maximum information of 
at least 0.16 produced reasonably accurate ability estimation for tests with a minimum 
test length of about 15 to 20 items. Moreover, the NR CAT was able to produce ability 
estimates comparable to those of the 3PL CAT. Implications of these results were 
discussed. 
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One important and very promising application of item response theory (IRT) is 
computerized adaptive testing (CAT). Unlike the conventional paper-and-pencil test in 
which an examinee, regardless of ability, is administered all test items, CAT is a 
procedure for administering tests which are individually tailored for each examinee. The 
advantage of IRT-based CAT over paper-and-pencil testing have been well documented 
(e.g., Wainer, 1990; Weiss, 1982). 

Although not necessary (cf., De Ayala, Dodd, & Koch, 1990), a CAT system typically 
uses an IRT model in combination with test item characteristics to estimate the examinee's 
ability. Typically, either the dichotomous three-parameter logistic (3PL) or Rasch models 
(e.g., McBride & Martin, 1983; Kingsbury & Houser, 1988) have been used in CAT. These 
models do not differentiate between an examinee's incorrect answer and other incorrect 
alternatives for purposes of ability estimation. In short, dichotomous models and 
dichotomous model-based CATs operate as if an examinee either knows the correct answer 
or randomly selects an incorrect alternative. 

The operation of dichotomous model-based CATs do not incorporate findings from 
human cognition studies (e.g.. Brown & Burton, 1978; Brown & VanLehn, 1980; Lane, Stone, 
& Hsu, 1990; Tatsuoka, 1983). For instance, Tatsuoka's (1983) analysis of student 
misconceptions in performing mathematics problem showed that wrong responses could be 
of more than just one kind, however, dichotomous scoring uniformly assigned a score of 
zero to all the wrong responses. Moreover, it has been demonstrated by Nedelsky (1954), 
from a classical test theory (CTT) perspective, and Levine and Drasgow (1983), from an IRT 
perspective, that the distribution of wrong answers over the options of multiple-choice 
items differed across ability levels. In this regard, an item's incorrect alternatives may 
augment our estimate of an examinee's ability by providing information about the 
examinee's level of understanding (i.e., provide diagnostic information). Both Bock (1972) 
and Thissen (1976) have found that for examinees with ability estimates in the lower half 
of the ability range the nominal response (NR) model provided from one third to nearly 
twice the information furnished by a dichotomously scored two-parameter model; there 
was no difference in information yield between these two models for ability estimates 
above the median e. It should be noted that in an application to multiple-choice and free- 
response items. Vale and Weiss (1977) found that the NR model provided more information 
for middle ability examinees than that shown in the Bock (1972) and Thissen (1976) 
studies. In CTT. the use of proper scoring techniques to assess this partial knowledge 
yields increases in the reliability of multiple choice tests (e.g.. Coombs, Milholland, and 
Womer, 1956). Frary (1989), Haladyna and Sympson (1988), and Wang and Stanley (1970) 
all provide a review of the literature on option scoring strategies. It is obvious that the 



4 



dicholomizalion of the examinee's response ignores any partial knowledge that the 
examinee may have of the correct answer and, as a result, this information cannot be used 
for ability estimation. 

Some research has explored the benefits and operating characteristics of CATs based 
on polytomous IRT models (e.g., Dodd, Koch, & De Ayala, 1989; Koch & Dodd, 1989; 
Sympson, 1986). Research on the use of polytomous IRT models in CAT has shown 
promising results. For instance, Sympson (1986) found that adaptive tests based on a 
polytomous model (Model 8) could be shortened by 15-20% without sacrificing test 
reliability. In addition, these studies have shown that item pools smaller than those used 
with dichotomous model-based CATs have lud to satisfactory estimation, that the use of 
the ability's standard error of estimation for terminating the adaptive test is preferred to 
the minimum item information termination criterion, and that the use of a variable 
stepsize instead of a fixed stepsize tends to minimize nonconvergence of trait estimation; 
the models under study were Masters's (1982) partial credit (PC), Andrich's (1978) rating 
scale (RS), and Samejima's (1969) graded response (GR) models. 

Bock's (1972) NR model is appropriate for items with unordered responses, such as 
multiple-choice aptitude and achievement test items. In addition, the NR model may be 
used with testlets (Wainer & Kiely, 1987) to solve various testing issues, such as 
multidimensionality (Thissen, Steinberg, & Mooney, 1989), vnih items which do not have a 
"correct" response, such as demographic items (e.g., to provide ancillary information), and 
items whose alternatives provide educational diagnostic information. Moreover, innovative 
computerized item formats may be specifically developed for use with polytomous models 
and adaptive testing environments. Presently, CATs typically present simple paper-and- 
pencil item formats. 

The objectives of this study concerned the implementation of an NR model-based CAT 
(NR CAT) and were three-fold. First, because the NR model is written in terms of slope 
and intercept parameters, a form not typically used (cf., Hambleton & Swaminathan, 1985; 
Lord, 1980; Weiss, 1983), formulae for the location parameters were derived in order to 
facilitate understanding the model's formulation. In this regard, the NR model's 
relationship with the dichotomous two-parameter logistic (2PL) model was presented. 
Moreover, because of the importance of item information in CAT, the effect of varying the 
location parameters on the distribution of item information was examined. Second, 
paramount to CAT performance is the quality of the item pool. Two factors which 
determine the item pool's quality are the locations of the item and their discrimination 
indices. Because it is accepted that items should be evenly and equally distributed 
throughout the 8 continuum of interest (Patience & Reckase, 1980; Urry, 1977; Weiss, 
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1982) and there is no reason to believe that this would not hold for the NR model, this 
factor was not studied. However, the minimum item information (i.e., the discrimination 
indices' effect) which would allow reasonably accurate ability estimates by the HR CAT 
was investigated. This investigation (referred to as Study 1) was limited to the 2-, 3-, and 
4-category cases. Third, the comparative performance of the NR CAT and a CAT based on a 
dichotomous (3PL) model was assessed (referred to as Study 2). Furthermore, because of 
the existence of option information an exploratory simulation was conducted in which 
items were selected on the basis of option information. 

Model 

The NR model assumes that item alternatives represent responses which are 
unordered. The NR model provides a direct expression for obtaining the probability of an 
examinee with ability 9 responding in the j-th category of item i as: 

Pij(e)= --P^-M^aije) 

m j 

X exp(cij + aijB) 
h=l 

where aij is the slope parameter, cij is the intercept parameter of the nonlinear response 
function associated with the j-th category of item i» and mj is the number of categories of 
item i (i.e., j = 1, 2, mj). For convenience the slope and intercept parameters are 
sometimes represented in vector notation, where a = (ail, ^i2» aim) and c = (cii, ci2, ...» 
^'im)» respectively. As an aide to interpreting these parameters a logistic space plot of the 
(multivariate) logit (i.e., cij + aijB) against 9 for a three-category (m = 3) item with a = (-0.75, 
-0.25, 1.0) and c = (-1.5, -0.25, 1.75) is shown in Figure 1. As can be seen, the cij's value is 
the y-intercept (i.e., 9 = 0.0) and aij is the slope of the category's response function. The aijs 
are analogous to and have an interpretation similar to traditional option discrimination 
indices. That is, a crosstabulation of ability groups by item alternatives shows that a 
category with a large aij reflected a response pattern in which as one progressed from the 
lower ability groups to the higher ability groups there was a corresponding increase in the 
number of persons who answered the item in that category and for categories with negative 
ajjs this pattern was reversed. Moreover, it appears that, in general, large values of cjj are 
associated with categories with large frequencies and as the value of cij becomes increasingly 
smaller the frequencies for the corresponding categories decrease. 



Insert Figure 1 about here 
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The probability of responding in a particular category as a function of 0 is depicted by 
the category or option characteristic curve (OCC). Figure 2 contains the OCCs 
corresponding to the three category item presented in Figure 1. 



Insert Figure 2 about here 

The intersection of the OCCs can be obtained by setting adjacent category multivariate 
logit equal to one another and solving for 0. Therefore. 

9=^^ . (2) 
a2 - ai 

In general, for any item with mj > 2 and because 0 and b are on the same scale: 
aj - a(j.i) 

This formulation is analogous to that of the PC model in which step difficulties are 

defined at the intersection of adjacent category characteristic curves. 

In Bock (1972) the NR model is compared with a binary version (i.e., the item consists 

of correct and incorrect categories). When mj = 2 then (1) becomes, 

. exp(c2 + a20) 

^ exp(ci + ai0)+exp(c2 + a20) * 

Given (4) and noting that the two linear constraints imposed on the item parameters. 

Xa=0 and Xc=0 (to address the indeterminacy of scale), imply that in the two-category 

case 

ai = -a2 and (5) 
ci = -C2. (6) 
Therefore, given (5) and (6) one obtains that for mj = 2 

a2 

Solving (7) for C2 and substituting the equality into (4). 

^^P(-^^2b + a20) 
^^^^ exp(-2a2i? + a2e) + exp(ai0) ' ^ ^ 

By substitution of (5) into (8), and simplifying, one obtains 

P2(0) = {1 + exp(-2a2(0 -i^)))'^ . (9) 
Therefore, if one casts the NR model's discrimination parameters in terms of the 2PL 
model's discrimination parameter, a, and because a is typically positive: 

a = l-2a2l = I2ail ; (10) 

for mi = 2 the 2PL and NR models are equivalent. For example. Figure 3 shows the NR 

model's OCCs for an item with a2 = 0.40, ai = -0.40, C2 = 0.2 and ci = -0.20 and the item 

0.4 

characteristic curve (ICC) for the 2PL model with a = 0.80 and b - = -0.5. 
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Insert Figure 3 about here 



I nformation 

For the NR model, the item information (Ii(8)) is equal to the sum of the option 
informations, where option information may be defined as (Bock, 1972) 



Iij(e)=:aWa'pij(e), 

and item information is 
mi 

Ii(e)= SaWa'piji(e) = aWa' . 
h=l 



(11) 



(12) 



Where for a given item i, 

"pi(l-Pl) -P1P2 • 
-P2P1 P2(l-P2). 

W = 



-PlPm 
■P2Pm 



ci " C2 



-PmPl -PmP2 Pm(l-Pm)-^ 

For the m; = 2 case, the location of maximum item information (Imax) is 6max =" 

a 2 - a 1 

with Imax = 0.25(a2 - ai)^. Due to the number of unknowns a formula for the location of 
maximum item information cannot be determined for m\ > 2. When mi = 2 and for a given a 
changing the values of c forces the location of Imax to shift along the G continuum, but the 
maximum amount of information remains constant. 

For the mi = 3 case and for a given a, if the ^s are in ascending order, then the item 
information function becomes comparatively more leptokurtic as the difference between ^s 
become less extreme. When the ^s are in descending order, then item information 
function becomes comparatively more platykurtic as the difference between ^s become 
less extreme. In both cases there is also a shifting in the location of Imax- 

For the mi = 4 case and for a given a, if the bs are in ascending order, then the item 
information function becomes comparatively more platykurtic as the difference between 
^s become less extreme. This pattern holds if one reverses the last two ^s. When the 
are in descending order, then relative to the item information function when the ^s arc in 
ascending order, the function becomes more leptokurtic as the difference between 
become less extreme. This is also true if one transposes the first two bs. For the other 
two possible b patterns, the information function becomes comparatively more leptokurtic 
as the distance among the ^s decreases. Moreover, it is possible to obtain bimodal item 
information functions. For instance, Figure 4 contains the information function for an 
item where a = (1, 0.1, -0.1, -1) and c = (0.1, 2.4, -2.6, 0.1). 
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Insert Figure 4 about here 

As (12) implies item information is a function of the magnitude of the elements of a and 
the order of the elements of a (i.e.. for a given c» a = (-0.25, 1.0, -0.75), a = (-0.25, -0.75, 
1.0) and a = (-0.75, -0.25, 1.0) will produce three different ImaxS at three different OmaxS. 
For a given a the signs of the elements are irrelevant as long as Xa=0 (and 2c=0). For 
instance, given two items with the same c (e.g., c = (0.25, -0,15, -0.1)) but as which differ 
only in the sign of the elements, such as a = (0.4, 0.25, -0.65) and a = (-0.4. -0.25, 0.65), the 
items will have the same Imax = 0.245 but at different Omaxs; specifically, 8max = 0.83985 
for a = (-0.4, -0.25, 0.65) and for a = (0.4,0.25,-0.65) Omax = -0.83985. This is also true in 
the four category case. Given the same c, two items whose as differ only in the sign of the 
elements (and satisfy Xa=0), such as a = (0.55, 0.4, -0.35, -0.6) and a = (-0.55, -0.4, 0.35, 
0.6) will yield Imax = 0.258679 at Gmax = 0.059 and Gmax = -0.059, respectively. 

METTHOD 

Study 1: Determination of Minimum Item Information for use in NR CAT 
Programs: A program for performing adaptive testing with the NR model was written (NR 
CAT). The program used expected a posteriori (EAP) estimation (Bock & Mislevy, 1982) of 
ability and item selection was on the basis of information. The adaptive testing 
simulation was terminated when a maximum of thirty items v/as reached. Ability 
estimates at test lengths of 10, 15, 20, 25 and 30 items were recorded. The initial ability 
estimate for an examinee was the population's mean and a uniform prior with ten 
quadrature points was used. An additional program for generating the data according to 
the NR model was written and is discussed below. 

Data: A series of item pools were created. The item pools differed from one another on the 
basis of two factors, maximum item information, Imax» number of item alternatives, m 

2, 3, and 4 options. The item pool size was 90 items (cf., Dodd, Koch, & De Ayala, 1989; Koch 
& Dodd, 1989). 

Although Urry's (1977) guidelines for the discrimination parameter were stated in 
terms of a's magnitude, the importance of an item's a value is its effect on Imax- Because 
when the number of categories is three or more different combinations of a and c can 
produce the same Imax value, establishing guidelines in terms of the magnitude elements 
of these vectors was not pursued. Rather, specified values for Imax were set a priori and 
the a vector to obtain a specific Imax was determined. The Imax values studied were 0.25, 
0.16, 0.09, and 0.U4. 

When mi = 2, the a vectors may be specified a priori. For the Imax values of 0.25, 0.16, 
0.09, and 0.04, the corresponding as were (0.50, -0.50), (0.40, -0.40), (0.30, -0.30), and 
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(0.20, -0.20), respectively. (For the 2PL model these as are equivalent to as of 1.0, 0.8, 0.6, 
and 0.4, respectively.) Because Urry (1977) has recommended the use of items with a > 0.80 
in CAT, for the m\ = 2 condition the use of a = (0.40. -0.40) was expected to be equivalent to 
the use of a = 0.80 with a 2PL model-based CAT. For each I^ax level of the mj = 3 and mj = 
4 conditions the a vectors for the items were chosen through a trial-and-error procedure to 
approximate the relevant Imax value. 

A number of researchers have slated that the item bs should be evenly distributed 
throughout the G range of interest (e.g., Patience & Reckasc, 1980; Urry, 1977; Weiss, 1982). 
Therefore, item b(s) were distributed at nine scale points between -4.0 to 4.0 in increments of 
1 logit (i.e., for item 1 b ~ -4.0, for item 2 b = -3.0. etc.); for the mj > 2 conditions the average 
location for an item was set at one of the nine scale points. 

Once the a vector for a given Imax level was detennined, then the c vector to locale the 
items, in terms of its h (for mj = 2) or average b (for m\ > 2), at the specified scale points 
could be calculated. Therefore, these item sets consicJted of 9 items with a constant 
maximum information which were distributed to encompass the examinee ability range. 
These 9 items were replicated to produce a 90-item pool for each of the 12 combinations of 
^e 4 Iniax levels crossed by the 3 m\ levels. De Ayala, Dodd, & Koch (1990) found that 
multiple items with the same parameters were administered to an examinee as the CAT 
estimation algorithm approaches its final ability estimate. 

Thirteen hundred examinees' abilities were generated to be evenly distributed 
between -3.0 and 3.0 using a one-half logit interval between successive 0 levels (i.e.. for 
100 examinees 9 =-3.0. for 100 examinees 0 =-2.5, etc.). These true Gs (Oys) plus the 90 
item parameters for each condition were used to generate polytomous response strings 
with a random error component for each simulated examinee (i.e., 12 response data sets 
were created). Generation of an examinee's polytomous response string was accomplished 
by calculating the probability of responding to each alternative of an item according to the 
NR model. Based on the probability for each alternative, cumulative probabilities were 
obtained for each alternative. A random error component was incorporated into each 
response by selecting a random number from a uniform distribution [0.1] and comparing it 
to the cumulative probabilities. The ordinal position of the first cumulative probability 
which was greater than the random number was taken as the examinee's response to the 
item. 

Analysis: The focus of Study 1 was to determine the minimum Imax value which would result 
in a significant improvement in the estimation of ability. The accuracy of ability estimation 
was assessed by root mean square error (RMSE) and Bias. RMSE and Bias were calculated 
according to: 
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RMSE(e) = V ^ ■ (13) 

> nf 

Bias(e) =^ , (14) 

nf 

A 

where Oj^ is ihc ability estimate for examinee k with latent ability 0t» " the number 
of examinees at interval f (i.e., nf = 100). 

The analysis of the 2-» 3- and 4-category cases were treated as separately. Therefore, 
the basic design is a one-group repeated measures with two dependent variables^ RMSE and 
Bias» with Iniax the between subjects factor and test length as the within subjects 
factor. The test length factor was included because the accuracy of ability estimation is 
influenced by both the adaptive test length as well as the information content of the items 
administered. Because the Bonferroni method was used to control for familywise Type I 
error, a was set at 0.008 (i.e.» 0.05/6). Post hoc analysis was performed with the Scheffe 
test using a critical F of 13.2595 (=("^1 )Fo.008. 3» 48)* Descriptive statistics on the 
adaptive tests were calculated. 

Study 2: Comparative performance of the NR and 3PL CATs 

Programs: The NR CAT program from Study 1 was used in Study 2; the NR CAT could select 
items on the basis of eithrr item or option information. An additional CAT program based 
on the 3PL model (3PL CAT) was written. The 3PL CAT program estimated ability through 
EAP and selected items on the basis of information. The adaptive testing simulation was 
terminated when either of two criteria were met: a maximum of thirty items was reached or 
when a predetermined standard error of estimate (SEE) was obtained (SEE termination 
criteria of 0.20» 0.25» 0.30 were used). The initial ability estimate for an examinee was 
the population's mean. Both CATs used a ten point uniform prior distribution. 

A data generation program based on a linear factor analytic model (Wherry» NayIor» 
Wherry, & Fallis» 19^5) was written and is discussed below. The linear factor analytic 
approach for generating the data was used to minimize any bias in favor of either the 3PL 
or NR model; this procedure has been (used previously (De Ayala. Dodd, & Koch, in press; 
Dodd, 1984; Koch, 1981; Reckase» 1979). 

Calibration: MULTILOG (Thissen, 1988) was ueed to obtain item parameter estimates for 
the NR and 3PL models using default program parameters. 

Data: Thirteen hundred examinees' abilities were generated to be evenly distributed 
between -3.0 and 3.0 using a one-half logit interval between successive 9 levels. The 
examinees' responses to 150 4-altemative items were generated according to the linear 
factor analytic model: 

zki = aieTk + V 1-hiZeici (15), 
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where Qj^ was examinee k's latent ability, ai was item i's factor loading, h| was item i's 
communality, and Zejci ^^^^ a random number generated from a N(0,1) distribution to be 

the error component of examinee k and item i. All factor loadings were uniformly high 
and ranged from 0,62 to 0,84. Subsequent to the calculation of zki, zki was compared to 
pre-specified category boundaries to determine the category response for examinee k to 
item i. 

These data were submitted to MULTILOG to obtain item parameter estimates for both 
the NR and 3PL models. Given the results of Study 1» item pools for the NR and the 3PL 
CATs were constructed by identifying items with values of Imax ^ 0-^6 and whose Qmax 
values were evenly distributed throughout the -2,0 lo 2,0 ability range. These items were 
replicated to produce item pools of 152 items. 

Analysis: The focus of Study 2 was to determine whether there were any psychometric 
advantages to be achieved by using the polytomous NR model as oppose to the dichotomous 
3PL model. The quality of the ability estimation provided by the two CATs was analyzed by 
calculating RMSE and Bias. Moreover, the number of items administered (NIA) in obtaining 

A 

9 was also used for comparing the two types of CATs. The design was a one-group repeated 
measures design with three dependent variables: RMSE, Bias, and NIA; type of CAT (NR, 3PL) 
was the between subjects factor and SEE termination criterion (0.20, 0,25, 0,30) was the 
repeated measures or within subjects factor. Because the Bonferroni method was used to 
control for familywise Type I error, a was set at 0.0056. Post hoc analysis was performed 

with the Scheffc test using a critical F of 10.223 (=(^ l)Fo.0056, 1, 16)' 

Because of the item pool characteristics only examinees with -2,0 <.6'j-< 2.0 were used 

in the CATs. For each of tlicse 900 examinees an adaptive test was simulated using the NR 
and 3PL CATs, the relevant item pool and SEE termination criterion. Descriptive 
statistics on the adaptive tests were calculated. 

RESULTS 

Study 1 

Table 1 contains descriptive statistics on the NR adaptive tests. As would be 
expected, there was a direct relationship between the fidelity coefficient, ^eOj' ^max 

as well as between ^99^ and test length. For Imax = 0-25 there was a slight increase in 

^90 J number of categories increased for a given test length: 10, 15, 20, or 25 items; 

this increase in ^qqj tended to diminish with increasing test length. 

Insert Table 1 about here 
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The repealed measures analyses are presented in Table 2. As can be seen for the two 
category condition (Table 2a) the average RMSE improved significantJy as both test length 
and Imax increased. Post hoc analysis of the Imax factor showed that for the two category 
case there was a significant reduction in RMSE as Imax increased from 0.04 to 0.09 to 0.16 
for tests of 15-, 20-, 25- and 30-items in length. Increasing the item information content 
from 0.16 to 0.25 did not produce a significant improvement in ability estimation as 
assessed by RMSE. For the 10-item test there was, in addition to the above finding, a 
significant improvement in accuracy of estimation from 0.16 to 0,7.5. That is, for the 
shorter test length of 10 items more informative items were needed than at longer test 
lengths. 

Insert Table 2 about here 

For all Imax values there was a significant improvement in the accuracy of estimation 
as tests increased in length from 10 to 15 to 20 items. As would be expected, at higher 
item information levels (e.g., 0.16 and 0.25) increasing the length of the tests from 20 to 
25 items or from 25 to 30 items did not yield a significant reduction in RMSE; for Imax = 
0.09 estimation accuracy was significantly improved by increasing the test length from 20 
to 25 items, but not from 25 to 30 items. In short, it appears that the use of items with 
Imax ^ 0-16 (i.e., a >. 0.80) provides reasonable ability estimation for tests of 20 (possibly 
15) or more items. With shorter length tests more informative items are required than at 
longer test lengths. Test length and Imax did not have a significant effect on Bias. This 
is, in part, a function of the way Bias is calculated and the potential for cancellation of 
negative Bias by positive bias. Figure 5 contains RMSE and Bias plots for selected NR 
cats; these plots are typical of all the NR CAT plots. 

Insert Figure 5 about here 



For the three category condition (Table 2b) and test lengths of 20 or more items the 
results were similar to the two category condition. That is, there was a significant 
reduction in RMSE as Imax increased from 0.04 to 0.09 to 0.16, but not from 0.16 to 0.25. 
However, for the 10- and 15-item test lengths the results were the reverse those of the two 
category condition. In general, results for the four category condition (Table 2c) parallel 
those of the two- and three-category condition. That is, there was a significant reduction 
in RMSE as Imax increased from 0.04 to 0.09 to 0.16 to 0.25 for tests of 20 or fewer items. 
There was no significant reduction in RMSE as Imax increased from 0.16 to 0.25 for tests 
of 25 or 30 items. 
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Study 2 

Table 3 contains descriptive statistics on the NR and 3PL adaptive tests. The results 
for the NR and 3PL CATs tended to be comparable with the only meaningful difference in 
^687 appearing at a termination SEE of 0.30. However, the NR CAT tended to administer 

adaptive tests which, on average, were shorter than those of the 3PL CAT. 

Insert Table 3 about here 



Table 4 contains the source tables for the repeated measures analysis. With respect to 
RMSE and Bias there were no significant differences between the 3PL and NR CATs. 
Although the NR CAT did administer, on average, fewer items than did the 3PL CAT to 
achieve the same accuracy in estimation, this difference was not significant using the 
Bonfcrroni criterion. That is, the ability estimation of the NR CAT was comparable to that 
of the 3PL CAT. 



Insert Table ^ about here 



Because with a polytomous model item information is the sum of the information 
functions for individual responses (a.k.a., category or option information function) an 
exploratory study selecting items on the basis of category information was conducted (i.e., 
which item provided the maximum information for the particular alternative chosen by the 
examinee). It was believed that selecting items on the basis of category information would 
be more consistent with the concept of polytomous scoring of examinee responses than 
selecting items on the basis of item information which ignores which particular response 
an examinee provided. (Of course, the likelihood function is a function of an examinee's 
particular responses.) This exploratory study used the same simulated data and programs 
as Study 2, except that items were selected on the basis of category information rather 
than on the basis of item information. These results are provided in Table 5 and as can be 
seen parallel those presented in Table 4. Specifically, the NR CAT which selected items 
on the basis of category information provided ability estimation which, in terms of RMSE 
and Bias was comparable to that of the 3PL CAT. However, unlike the NR CAT results 
presented previously, selecting ii;ms on the basis of category information did result in 
the NR CAT administering significantly shorter tests, on average, than did the 3PL CAT 
for all SEE termination conditions. The post hoc comparison Fs for NIA were all 
significant at an overall a = 0.05 and were 12.074, 16.225, and 11.357 for the SEE 
termination criteria of 0.20, 0.25, and 0.30, respectively. As can be from Table 6, despite 
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this reduction in test length the NR CAT yielded fidelity coefficients comparable to those 
of the 3PL CAT. 



Insert Tables 5 and 6 about here 
DISCUSSION 

In general, the distribution of information was affected by the distance between the 
item's ^s, whether the ^s were in order, and the number of item alternatives. Study 1 
showed that for two-, three-, and four-category items, items with an Imax value of at least 
0.16 produced reasonably accurate ability estimation for test lengths of 15 or more items. 
Shorter length tests required more informative items to maintain reasonable ability 
estimation. 

Results from Study 2 seemed to indicate that the NR CAT was able to produce ability 
estimates comparable to those of the 3PL CAT. To achieve the same level of accuracy (e.g.. 
SEE = 0.20) the NR CAT administered fewer items, on average, than did the 3PL CAT (e.g., 
12.393 versus 16. 191, respectively). Although this latter result was nonsignificant, some 
practitioners may still consider it meaningful because in an implementation the adaptive 
test administered under the NR model would be shorter than it is under the 3PL model. 
However, a plot of the difference in average NIA between the NR and 3PL CATs versus Q 
showed that the NR CAT administered substantially fewer items, on average, primarily for 
examinees with < -1.0 (see Figure 6). A relative efficiency comparison of the 
information content of the item pools of the NR and 3PL CATs showed that although the NR 
model provided slightly more information than did the 3PL model throughout the ability 
range, the NR model began to provide substantially more information than the 3PL model 
below 0 = -1.0. Past experience with dichotomous models has shown that item pools which 
are more informative for the ability range below -1.0 than existed in the present study 
can be constructed. Therefore, practitioners should not consider the NR CAT's shorter 
average test lengths to necessarily be meaningful. This interpretation is also appropriate 
for the significant NIA results when category information was used for selecting items for 
the NR CAT. 

Insert Figure 6 about here 
It appears that an NR model-based CAT can provide ability estimation comparable to a 

A 

dichotomous model-based CAT. The NR CAT did not provide more accurate 9 for examinees 
with 9 < 0.0, relative to the 3PL CAT, because a variable test length was used. That is, the 
additional information provided by the NR model over a dichotomous model for the lower half 
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of the ability distribution resulted in the adaptive test terminating sooner than it would 
with the dichotomous model. For a given (reasonable) fixed lengiti lest, one would expect 

A 

thai the NR CAT would provide more accurate 6 for examinees with 6 < 0.0 than would a 
dichotomous model. 

For those situations presented above (testlets, administration of items which do not 
contain a correct response, such as, demographic items, innovative computerized item 
formats or items which contain educational diagnostic information) it appears that the NR 
CAT may be an viable CAT option. Given the exploratory results, the use of category 
information for item selection needs to be more systematically investigated. The use of 
category information for item selection may prove useful in certain situations. 
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Table 1: Mean G, standard deviation of 6 (SD), and rgg^ 



a 



*max 



Test Length 







1 0 


15 


20 


25 


30 


0.25 


mean 


0.021 


0.010 


0.002 


-0.002 


-0.003 




SD 


1.936 


1.921 


1.906 


1.898 


1.900 




r 


0.935 


0.956 


0.967 




u.y / / 


0.16 


mean 


0.027 


. 0.007 


0.001 


-0.009 


-0.009 




SD 


1.949 


' 1.927 


1.923 


1.918 


1.914 




r 


0.910 


0.938 


0.954 


0.962 


0.968 


0.09 


mean 


0.052 


0.006 


-0.003 


-0.140 


-0.003 




SD 


1.952 


1.948 


1.948 


1.950 


1.937 




r 


0.863 


0.905 


0.926 


0.939 


0.949 


0.04 


mean 


0.068 


0.061 


0.020 


0.014 


0.009 




SD 


1.875 


1.908 


1.932 


1.945 


1.956 




r 


0.759 


0.818 


0.855 


0.880 


0.900 


0.25 


mean 


-0.003 


0.003 


0.000 


0.004 


0.001 




SD 


1.951 


1.936 


1.929 


1.924 


3.670 




r 


0.936 


0.958 


0.968 


0.974 


n mo 

u.y /o 


0.16 


mean 


-0.003 


-0.014 


-0.004 


0.010 


0.009 




SD 


1.959 


1.951 


1.952 


1.942 


1.938 




r 


0.918 


0.939 


0.956 


0.964 


0.971 


0.09 


mean 


-0.004 


-0.006 


-0.009 


-0.008 


0.000 




SD 


1.965 


1.963 


1.958 


1.951 


1.950 




r 


0.863 


0.903 


0.929 


0.941 


0.950 


0.04 


mean 


-0.0200 


0.000 


0.009 


0.015 


0.003 




SD 


1.881 


1.922 


1.939 


1.950 


1.954 




r 


0.763 


0.831 


0.868 


0.890 


0.907 


0.25 


mean 


-0.007 


-0.008 


-0.013 


-0.014 


-0.016 




SD 


1.969 


1.951 


1.941 


1.943 


1.934 




r 


0.942 


0.960 


0.969 


0.974 


0.977 


0.16 


mean 


-0.034 


-0.025 


-0.028 


-0.031 


-0.035 




SD 


1.979 


1.974 


1.973 


1.960 


1.961 




r 


0.912 


0.939 


0.951 


0.959 


0.964 


0.09 


mean 


-0.015 


-0.007 


-0.006 


-0.016 


-0.017 




SD 


1.978 


1.979 


1.975 


1.986 


1.985 




r 


0.855 


0.902 


0.925 


0.938 


0.945 


0.04 


mean 


-0.034 


-0.008 


-0.001 


-0.002 


-0.009 




SD 


1.902 


1.941 


1.963 


1.976 


1.982 




r 


0.752 


0.816 


0.847 


0.876 


0.892 



^Pearson product-moment correlation coefficients between 6-^ and 9. 
Mean 9j = 0.000 and Sq^ = 1.872. 
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Table 2a: Accuracy analysis for for NR CAT: two category condition. 
RMSE 



Source 


SS 


df 


MS 


F p 


Between Subjects 










^max 


10.591 


3 


3 -"30 


167.500* 0.000 


Subjects w/i Groups 


1.012 


48 


O.Uzl 




Within Subjects 










Test Length 


4.237 


4 


1.059 


553.451* 0.000 


^max X Test Length 


0.121 


12 


0.010 


5.288 0.000 


Test Length X Subjects 


0.367 


192 


0.002 




w/i Groups 










Post Hoc Comparison Fs for Imax- 


Comparison 




Test Length 






1 0 


1 5 


20 


25 


30 


^0.25 vs ^0.16 17.749* 


13.005 


8.998 


7.270 


6.561 


^0.16 vs ^0.09 41.007* 


30.926* 


28.357* 


24.213* 


20.397* 


^0.09 vs ^0.04 101.553* 


104.288* 


92.265* 


79.169* 


66.524* 


Post Hoc Comparison Fs for test length: 


Comparison 


^max 






0.04 


0.09 


0.16 


0.25 




H30 ^25 19.269* 


9.831 


6.009 


4.943 




^25 vs ^20 24.599* 


14.157* 


9.477 


6.581 




^20vsm5 45.253* 


32.500* 


28.109* 


18.281* 




mSvsmo 85.293* 


89.557* 


64.613* 


49.169* 




Bias 


Source 


SS 


df 


MS 


F p 


Between Subjects 










Imax 


0.041 


3 


0.014 


0.1 10 0.954 


Subjects w/i Groups 


5.964 


48 


0.124 




Within Subjects 










Test Length 


0.074 


4 


0.018 


2.237 0.067 


Imax ^ I'cst Length 


0.017 


12 


0.001 


0.176 0.999 


Test Length X Subjects 


1.580 


192 


0.008 





w/i Groups 

♦significant at overall a = 0.05, critical F = 13.260 (a = 0.008 per test). 
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Table 2a: Accuracy analysis for for NR CAT: two category condition (continued). 
Average RMSE: (2 categories) 



Imax 






Test Length 








10 


1 5 


20 


25 


30 


0.04 


1.298 


1.136 


1.018 


0.931 


0.854 


0.09 


0.999 


0.833 


0.733 


0.667 


0.612 


0.16 


0.809 


0.668 


0.575 


0.521 


0.478 


0.25 


0.684 


0.561 


0.486 


0.441 


0.402 



Q 2 
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Table 2b: Accuracy analysis for for ^'R CAT: three category condition. 
RMSE 



Source 


SS 


df 


MS 


F P 


Between Subjects 










Imax 


9.492 




3.164 


135.396* 0.000 


oUDjecis w/i oroups 


1 1 99 


to 






Within Subjects 










Test Length 


4.326 


4 


1.081 


495.580* 0.000 


•'•max 


0.168 


12 


0.014 


6.407 0.000 


Test Length X Subjects 


0.419 


192 


0.002 




w/i Groups 










Post Hoc Comparison Fs for 


^max- 








Comparison 




Test Length 






1 0 


15 


20 


25 


30 


^0.25 vs ^0.16 7.487 


13.622* 


7.400 


5.694 


3.891 


^0.16 vs ^0.09 52.625* 


29.602* 


24.008* 


21.579* 


21.284* 


^0.09 vs 1X0.04 82.226* 


64.287* 


62.766* 


54.019* 


46.795* 


Post Hoc Comparison Fs for test length: 


Comparison 


^max 






0.04 


0.09 


0.16 


0.25 




^30vs^25 14.224* 


8.163 


7.840 


4.232 




^25 vs ^20 22.495* 


13.796* 


10.609 


6.322 




M-20 vs |ii5 48.601* 


46.240* 


33.972* 


17.881* 




m5 vs mo 1 19.122* 


81.515* 


33.309* 


56.036* 




Bias 


Source 


SS 


df 


MS 


F p 


Between Subjects 










Imax 


0.002 


3 


0.001 


0.007 0.999 


Subjects w/i Groups 


5.114 


48 


0.107 




Within Subjects 










Test Length 


0.005 


4 


0.0013 


0.157 0.960 


Imax X Test Length 


0.010 


12 


0.0008 


0.091 1.000 


Test Length X Subjects 


1.677 


192 


0.0087 





w/i Groups 



*significant at overall a = 0.05, critical F = 13.260 (a = 0.008 per test). 



Table 2b: Accuracy analysis for for NR CAT: three category condition (continued). 
Average RMSE: (3 categories) 



Imax Test Length 





10 


1 5 


20 


25 


30 


0.04 


1.286 


1.095 


0.973 


0.890 


0.824 


0.09 


1.001 


0.843 


0.724 


0.659 


0.609 


0.16 


0.773 


0.672 


0,570 


0.513 


0.464 


0.25 


0.687 


0.556 


0.482 


0.438 


0.402 
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Tabic 2c: Accuracy analysis 


for for 


NR CAT: four 


category 


condition. 




RMSE 












Scurce 


SS 


df 


MS 


F 


P 


Between Subjects 












Imax 11.713 


3 


3.904 




.UUU 


Subjects w/i Groups 


1 "^O 1 
1 ,JO i 


48 


0.029 






Wlllliri OUUJCUlo 












Test Length 


3.731 


4 


0.933 


556.861* 0 


,000 


^max ^ 1 Col L>ungin 


n 1 77 

U . 1 / / 


12 


0.015 


8.826 0 


.000 


Test Length X Subjects 


0.322 


192 


0.002 






w/i Groups 












Post Hoc Comparison Fs for 


Imax' 










Comparison 




Test Length 








1 0 


1 5 


2 0 


25 


3 0 




^0.25 vs no. 16 20.337* 


15.961 


* 15.008* 


11.905 


10.488 




^0.16 vs no. 09 45.553* 


29.024 


* 18.212* 


15.720^ 


15.481* 




^0.09 vs ^0.04 75.979* 


79.718 


* 86.898* 


67.274* 


54.091* 





Post Hoc Comparison Fs for test length: 



Comparison 



0.04 



Imax 
0.09 0.16 



0.25 



^30 vs H25 13.375* 


4.232 


4.000 


2.560 






^25 vs ^20 33.309* 


13.375* 


9.522 


5.224 






^20 vs m5 28.242* 


36.689* 


15.546* 


13.796* 






m5 vs mo 94.357* 


102.299* 


56.895* 


43.184* 






Bias 












Source 


SS 


df 


MS 


F 


P 


Between Subjects 












Imax 


0.018 


3 


0.006 


0.059 


0.981 


Subjects w/i Groups 


4.888 


48 


0.102 






Within Subjects 












Test Length 


0.004 


4 


0.0010 


0.134 


0.970 


^max ^ 'Test Length 


0.008 


12 


0.0007 


0.078 


1.000 


Test Length X Subjects 


1.599^ 


192 


0.0083 






w/i Groups 












*significant at overall a = 


0.05, critical F = 13.260 (a = 0.008 per test). 





ERIC 



2o 



Table 2c: Accui'acy analysis for for NR CAT: four category condition (continued). 
Average RMSE (4 categories): 



Imax '^cst Length 





10 


1 5 


20 


25 


30 


0.04 


1.321 


1.151 


. 1.058 


0.957 


0.893 


0.09 


1.033 


0.856 


0.750 


0,686 


0.650 


0.16 


0.810 


0.678 


0.609 


0.555 


0.520 


0.25 


0.661 


0.546 


0.481 


0.441 


0.413 
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Table 3: Descriptive statistics for NR and 3PL CATs. Item selection on the basis of item 
information for both NR and 3PL CATs. 



CAT 


SEE 


Mean 


SD 


Mean 


Median 


SD 


rb 






A 

e 


A 

e 


NIA^ 


NIA^ 


NIA^ 




3PL 


0.30 
0.25 
0.20 


0.168 
0.152 
0.171 


1.193 
1.165 
1.164 


12.759 
15.073 
16.191 


10.000 
13.000 
13.000 


5.927 
6.335 
6.879 


0.902 
0.925 
0.928 


NR 


0.30 
0.25 
0.20 


0.275 
0.267 
0.269 


1.200 
1.190 
1.186 


9.682 
10.763 
12.393 


8.000 
9.000 
10.000 


5.871 
6.472 
6.532 


0.926 
0.926 
0.929 



^Number of items administered 

h ^ 
"Spearman rank-order correlation coefficients between 6 and 6^. 



Note: 9^ = 0.000, Sg.^ = 1.292. 
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Tabic 4: Accuracy analysis for NR and 3PL CATs. Item selection on the basis of item 
information for both NR and 3PL CATs. 



RMSE 



Source 


SS 


df 


No 


r 


P 


Between Subjects 












CAT Type 


0.054 


1 


0.054 


0.681 


0.421 


Subjects w/i Groups 


1.267 


16 


0.079 






Within Subjects 












SEE Term 


0.022 


2 


0.011 


12.584* 


0.000 


CAT Type X SEE Term 


0.004 


2 


0.002 


2.527 


0.096 


SEE Term X Subjects 


0.028 


32 


0.001 






w/i Groups 












Bias 


Source 


SS 


df 


MS 


F 


P 


Betv/een Subjects 












CAT Type 


0.154 


1 


0.154 


0.661 


0.428 


Subjects w/i Groups 


3.736 


16 


0.234 






Within Subjects 












SEE Term 


0.001 


2 


0.0005 


1.492 


0.240 


CAT Type X SEE Term 


0.001 


2 


0.0005 


0,763 


0.475 


SEE Term X Subjects 


0.014 


32 


0.0004 






w/i Groups 












NIA 


Source 


SS 


df 


MS 


F 


P 


Between Subjects 












CAT Type 


187.638 


1 


187.638 


8.068 


0.012 


Subjects w/i Groups 


372.095 


16 


23.256 






Within Subjects 












SEE Term 


85.231 


2 


42.615 


76.371* 


0.000 


CAT Type X SEE Term 


3.455 


2 


1.728 


3.096 


0.059 


SEE Term X Subjects 


17.856 


32 


0.558 






w/i Groups 













♦significant at overall a = 0.05, critical F = 10.223 (a = 0.0056 per test). 
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Table 5: Accuracy analysis for NR and 3PL CATs. Item selection on the basis of category 
information for NR CAT and via item information for 3PL CAT. 

RMSE 

Source SS df WE F p 

Between Subjects 

CAT Type 0.018 1 0.018 0.203 0.658 

Subjects w/i Groups 1.450 16 0.091 

Within Subjects 



SEE Term 


0.035 


2 


0.017 


9.023 


0.001 


CAT Type X SEE Term 


0.008 


2 


0.004 


2.026 


0.148 


SEE Term X Subjects 


0.062 


32 


0.002 






wA Groups 












Bias 


Source 


SS 


df 


MS 


F 


P 


Between Subjects 












CAT Type 


0.196 


1 


0.196 


0.767 


0.394 


Subjects w/i Groups 


4.085 


16 


0.255 






Within Subjects 












SEE Term 


0.004 


2 


0.002 


1.206 


0.313 


CAT Type X SEE Term 


0.007 


2 


0.004 


2.416 


0,105 


SEE Term X Subjects 


0.048 


32 


0.001 






w/i Groups 












NIA 


Source 


SS 


df 


MS 


F 


P 


Between Subjects 












CAT Type 


335.653 


1 


335.653 


13.883* 


0.002 


Subjects w/i Groups 


386.833 


16 


24.177 






Within Subjects 












SEE Term 


102.072 


2 


51.036 


74.531* 


0.000 


CAT Type X SEE Term 


2.123 


2 


1.062 


1.550 


0.228 


SEE Term X Subjects 


21.912 


32 


0.685 






w/i Groups 













2j 
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Table 6: Descriptive statistics for NR and 3PL CATs. Item selection on the basis of 
category information for the NR CAT and item information for the 3PL CAT. 



CAT 


SEE 


Mean 


SD 


Mean 


Median 


SD 


rb 






A 

0 


A 
0 


NIA^ 


NIA^ 


NIA^ 




3PL 


0.30 
0.25 
0.20 


0.168 
0.152 
0.171 


1.193 
1.165 
1.164 


12.759 
15.073 
16.191 


10.000 
13.000 
13.000 


5.927 
6.335 
6.879 


0.902 
0.925 
0.928 


NR 


0.30 
0.25 
0.20 


0.302 
0.292 
0.259 


1.157 
1.170 
1.180 


8.121 
9.532 
11.411 


6.000 
8.0v,0 
10.000 


4.956 
5.116 
6.195 


0.916 
0.918 
0.924 



^Number of items administered 

^Spearman rank-order correlation coefficients between 0 and 0^. 



Note: 



Qj = 0.000, Sq^ = 1.292. 
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Figure Captions 

Figure 1. Multivariate logit plot for a three category item, a = (-0.75, -0.25, LO) and 
c = (-1.5, -0.25, 1.75), in the category selected and logit spaces. 

Figure 2. Example OCCs for a three category item, a = (-0.75, -0.25, 1.0) and c = (-1.5, 
-0.25, 1.75). 

Figure NR model's OCCs (a2 = 0.40, ai = -0.40, C2 = 0.2, and ci = -0.20) and the 2PL ICC 
(a = 0,80 and ^ = -0.5). 

Figure 4. Bimodal information function for an item where a = (1, 0.1, -0.1, -1) and c = 
(0.1, 2.4, -2.6, 0.1) 

Figure 5a. RMSE plot for NR CAT (mj = 3, NIA = 20). 

Figure 5b. Bias plot for NR CAT (mj = 3, NIA = 20). 

Figure 6. Average NIA for NR CAT minus average NIA for 3PL CAT. 



Category 1 
Category 2 
Category 3 
Value 




-4-3-2-10 1 2 3 4 

e 



3^ 



Category 1 
Category 2 
Category 3 




NR-category 1 
NR-category 2 
2PL IRF 
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RMSE - 0.04 
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